What is text clustering?
What are the applications?
How to cluster text data?
What is text clustering?
What are the applications?
How to cluster text data?
Discover “natural structure” of data
Clustering - the process of grouping a set of objects into clusters of similar objects
Basic criteria
high intra-cluster similarity
low inter-cluster similarity
No (little) supervision signal about the underlying clustering structure
Need similarity/distance as guidance to form clusters
Organize document collections
Grouping search results
Organize documents by topics
Facilitate user browsing
Basic properties
Minkowski metric
\(d(x,y) = \sqrt[p]{\sum^V_{i=1}{(x_i-y_i)^p}}\)
Cosine metric
\(𝑑(x,y)=1−cosine(x,y)\)
Edit distance
Count the minimum number of operations required to transform one string into the other
← Can be efficiently solved by dynamic programming
Edit distance
Count the minimum number of operations required to transform one string into the other
Extent to distance between sentences
Word similarity as cost of replacement
“terrible” -> “bad”: low cost → Lexicon or distributional semantics
“terrible” -> “terrific”: high cost → Lexicon or distributional semantics
Preserving word order in distance computation
Hard clustering: Each document belongs to exactly one cluster
Soft clustering: A document can belong to more than one cluster.
Makes more sense for applications like creating browsable hierarchies
You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes
You can only do that with a soft clustering approach.
Partitioning method: Construct a partition of \(n\) documents into a set of \(K\) clusters
Given: a set of documents and the number \(K\)
Find: a partition of \(K\) clusters that optimizes the chosen partitioning criterion
Typical partitional clustering algorithms
k-means clustering
Typical partitional clustering algorithms
k-means clustering
Gaussian Mixture Model
Assumes documents are real-valued vectors.
Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, \(c\):
\[\vec \mu(c)=\frac{1}{|c|}\sum_{\vec a \in c}{\vec x}\]
Reassignment of instances to clusters is based on distance to the current cluster centroids.
Select \(K\) random docs \(\{s_1, s_2,… s_K\}\) as seeds.
Until clustering converges (or other stopping criterion):
For each doc \(d_i\):
(Next, update the seeds to the centroid of each cluster)
For each cluster cj
Several possibilities, e.g.,
A fixed number of iterations.
Doc partition unchanged.
Centroid positions don’t change.
Why should the K-means algorithm ever reach a fixed point?
K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm.
EM is known to converge.
Number of iterations could be large.
Results can vary based on random seed selection.
Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings.
Select good seeds using a heuristic (e.g., doc least similar to any existing mean)
Try out multiple starting points
Initialize with the results of another method.
Recomputing the centroid after every assignment (rather than after all points are re-assigned) can improve speed of convergence of K-means
Assumes clusters are spherical in vector space
Disjoint and exhaustive
Doesn’t have a notion of “outliers” by default
But can add outlier filtering
Typical hierarchical clustering algorithms
Bottom-up agglomerative clustering
Typical hierarchical clustering algorithms
Top-down divisive clustering
Start with all data as one cluster
Repeatedly splitting the remaining clusters into two
Starts with each doc in a separate cluster
The history of merging forms a binary tree or hierarchy.
Many variants to defining closest pair of clusters
\[sim(c_i, c_j) = \max_{x \in c_i, y\in c_j}{sim(x,y)}\]
Can result in “straggly” (long and thin) clusters due to chaining effect.
After merging \(c_i\) and \(c_j\), the similarity of the resulting cluster to another cluster, \(c_k\), is:
\[sim((c_i \cup c_j), c_k) = max(sim(c_i, c_k), sim(c_j, c_k))\]
\[sim(c_i, c_j) = \min_{x \in c_i, y\in c_j}{sim(x,y)}\]
Makes “tighter,” spherical clusters that are typically preferable.
After merging \(c_i\) and \(c_j\), the similarity of the resulting cluster to another cluster, \(c_k\), is:
\[sim((c_i \cup c_j), c_k) = min(sim(c_i, c_k), sim(c_j, c_k))\]
Three concepts: words, topics, and documents
Documents are a collection of words and have a probability distribution over topics
Topics have a probability distribution over words
Model:
Treat data as observations that arise from a generative probabilistic process that includes hidden variables: For documents, the hidden variables reflect the thematic structure of the collection.
Infer the hidden structure using posterior inference: What are the topics that describe this collection?
Situate new data into the estimated model: How does this query or new document fit into the estimated topic structure?
| What is Latent Dirichlet Allocation (LDA)? | A generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of latent topics. Each observed word originates from a topic that we do not directly observe. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. |
| What is used for? | The fitted model can be used to estimate the similarity between documents as well as between a set of specified keywords using an additional layer of latent variables which are referred to as topics. |
| How is it related to text mining and other machine learning techniques? | Topic models can be seen as classical text mining or natural language processing tools. Fitting topic models based on data structures from the text mining usually done by considering the problem of modeling text corpora and other collections of discrete data. One of the advantages of LDA over related latent variable models is that it provides well-defined inference procedures for previously unseen documents (LSI uses a singular value decomposition) |
\[p(\theta|\vec a) = \frac{\Gamma(\sum_i{a_i})}{\prod_i{\Gamma(\alpha_i)}}\prod_i{\theta_i^{\alpha_i - 1}}\]
The Dirichlet is conjugate to the multinomial. Given a multinomial observation, the posterior distribution of \(\theta\) is a Dirichlet.
The parameter \(\alpha\) controls the mean shape and sparsity of \(\theta\). Parameter \(\alpha\) is a k-vector with components \(\alpha_i\) >0
The topic proportions are a K dimensional Dirichlet. The topics are a V dimensional Dirichlet.
as we draw random variables from theta, I’m going to get distributions over 3 elements.
\(\theta\) ~ Dirichlet(1,1,1) = \(\alpha_1\) = \(\alpha_2\) = \(\alpha_3\) = 1, uniform distribution as an example
Dirichlet is parameterized by \(\alpha\), so as \(\alpha\) increases the chart gets more peaky.
When \(\alpha < 1 (s < k)\), you get sparsity and on the 3 simplex you get a figure with increased probability at the corners.
Scalability
Ability to deal with various types of data
No/less assumption about input data
Minimal requirement about domain knowledge
Interpretability and usability
Criteria to determine whether the clusters are meaningful
Internal validation
External validation
Coherence
Inter-cluster similarity v.s. intra-cluster similarity
Davies–Bouldin index
\(DB = \frac{1}{k}\sum_{i=1}^k{\underset{j \neq i}{\operatorname{max}}{(\frac{\sigma_i + \sigma_j}{d(c_i,c_j)})}}\) ← Evaluate every pair of clusters
We prefer smaller DB-index!
Coherence
Inter-cluster similarity v.s. intra-cluster similarity
Dunn index
\(D = \frac{ \underset{1 \leq i < j \leq k}{\operatorname{min}}d(c_i, c_j)}{\underset{1 \leq i \leq k}{\operatorname{max}} \sigma_i}\) We prefer larger D-index!
Limitation
No indication of actual application’s performance
Bias towards a specific type of clustering algorithm if that algorithm is designed to optimize a similar metric
Given class label \(\Omega\) ( Required, might need extra cost) on each instance
Purity: correctly clustered documents in each cluster
\[purity(\Omega, C) = \\ \frac{1}{17}(5 + 4 + 3)\]
Given class label Ω on each instance
Normalized mutual information (NMI)
Given class label \(\Omega\) on each instance
Rand index
Idea: we want to assign two documents to the same cluster if and only if they are from the same class
\(RI = \frac{TP + TN}{TP + FP + FN + TN}\) ← Essentially it is like classification accuracy
Given class label \(\Omega\) on each instance
Precision/Recall/F-measure
\[sim(c_i, c_j) = \frac{1}{|c_i \cup c_j|(|c_i \cup c_j| - 1)}\sum_{\vec x \in (c_i \cup c_j)} \sum_{\vec y \in (c_i \cup c_j): \vec y \neq \vec x}{sim(\vec x, \vec y)}\]
Compromise between single and complete link.
Two options:
Averaged across all ordered pairs in the merged cluster
Averaged over all pairs between the two original clusters
No clear difference in efficacy
\[\vec s(c_j) = \sum_{\vec x \in c_j}{\vec x}\] - Compute similarity of clusters in constant time:
\[sim(c_i, c_j) = \frac{\vec s(c_i) + \vec s(c_j) \cdot (\vec s(c_j) + \vec s(c_j)) - (|c_i| + |c_j|)}{(|c_i| + |c_j|)(|c_i| + |c_j| - 1)}\]
Internal criterion: A good clustering will produce high quality clusters in which:
the intra-class (that is, intra-cluster) similarity is high
the inter-class similarity is low
The measured quality of a clustering depends on both the document representation and the similarity measure used
Quality measured by its ability to discover some or all of the hidden patterns or latent classes in gold standard data
Assesses a clustering with respect to ground truth … requires labeled data
Assume documents with \(C\) gold standard classes, while our clustering algorithms produce \(K\) clusters, \(\omega_1\), \(\omega_2\), …, \(\omega_K\) with \(n_i\) members.
\[Purity(\omega_i) = \frac{1}{n_i}\max_j(n_{ij}) \ \ \ j \in C\]
Biased because having n clusters maximizes purity
Others are entropy of classes in clusters (or mutual information between classes and clusters)
Text clustering
In clustering, clusters are inferred from the data without human input (unsupervised learning)
However, in practice, it’s a bit less clear: there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents
Evaluation